Targeted Gene Metagenomic Data Analysis ◾ 291
7.3.5.2 Using Machine Learning Classifiers
The machine learning taxonomy classifiers use trained model to assign taxa to the
representative sequences rather than using the alignment approach. A classifier requires a
benchmark training dataset with known taxa for model training. With QIIME2, any of the
machine learning methods available in scikit-learn can be used to train a classifier for tax-
onomy assignment. However, there are also pre-fitted classifiers that can be used instead.
In the following, we will use a pre-fitted classifier for the taxonomy assignment, and later,
we will train a new model and use it as well.
For pre-fitted classifier, we can use “classify-sklearn” method of “q2-feature-classifier”
to assign taxa to the representative sequences that have been obtained from clustering or
denoising. First, we need to download a pre-fitted classifier. We can use a classifier pre-
trained on GreenGenes database with 99% OTUs using Naive Bayes machine learning
method. The pre-fitted classifiers are available at QIIME2 website at “https://docs.qiime2.
org/2022.2/data-resources/”. Create the subdirectory “classifiers” and download the classi-
fier artifact into it as follows:
mkdir classifiers
wget -O “classifiers/gg-nb-99-classifier.qza” \
“https://data.qiime2.org/2021.11/common/gg-13-8-99-nb-
classifier.qza”
Once the download has been completed, use that classifier artifact as an input for “clas-
sify-sklearn” method together with the representative sequence artifact generated in the
clustering or denoising step. In the following, we will assign taxa to the representative
sequences generated by DADA2:
qiime feature-classifier classify-sklearn \
--i-classifier classifiers/gg-nb-99-classifier.qza \
--i-reads dada2/rep-seqs_yoga_dada2.qza \
--o-classification taxonomy/nb_tax_yoga_dada2.qza
Instead of using a pre-fitted one, we can train a classifier using “feature-classifier” plu-
gin, which has two methods for model fitting: “fit-classifier-naive-bayes” for the training
of a naïve bayes classifier and “fit-classifier-sklearn” for the training of any scikit-learn
classifier.
Next, we will train a Naive Bayes classifier using GreenGenes reference sequences and
then we will use the fitted classifier to assign taxa to the representative sequences generated
by a previous clustering or denoising step.
For training any classifier, we need a training dataset with known labels. In the case
of taxonomy classification, we need representative sequences with known taxa. For our
example, we can use GreenGenes 13_8 97% OTU dataset. Remember that we downloaded
GreenGenes database before and stored it in the “gg_13_8_otus” subdirectory. We will use
the representative sequences “gg_13_8_otus/rep_set/97_otus.fasta” and their correspond-
ing taxonomic classifications “gg_13_8_otus/taxonomy/97_otu_taxonomy.txt”. Since the